Week 3 lecture notes

Reviewing introductory papers on interpretability and linguistic probes inside the black box of neural language models

Attention (Transformer architecture)

Attention can stand in for multiple transformations. For the concept of self-attention, we focus on Vaswani et al. (2017)-style computations which rely on a Query-Key-Value framework. The general idea is that we want to compute token embeddings

Vaswani et al. (2017): “An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

Three weight matrices:
- $W_Q$ , $W_K$ , $W_V$ for queries, keys, and values
- $q_i$ , $k_i$ are representations of a token $t_i$ whose embedding $e_i$ is multiplied by $W_Q$ and $W_K$ , respectively
- Multiplying $Q$ by $K^T$ , we obtain similarity scores; square but asymmetrical matrix
- Pass similarity scores through a softmax and multiply by $V$ , which projects back into the size of the vocabulary

As input to the next layer, we need something that is the hidden state size $d$ but which reweights all the $e_i$ s — that’s what $V$ does

Wikipedia: The computations for each attention head can be performed in parallel, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.

Quick demo of Transformers

extracting both hidden states, self-attentions, and intermediate predictions (sometimes called “attentions”)

using huggingface for BERT

Week 2 Readings: Hidden states

Tenney, I., Das, D., & Pavlick, E. (2019, July). BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4593-4601). https://aclanthology.org/P19-1452/

Probing procedure described in Tenney et al. (2019, ICLR)
- Formally, we represent a sentence as a list of tokens $T = [t_0,t_1,...,t_n]$ , and a labeled edge as ${s(1),s(2),L}$ . We treat $s(1) = [i(1),j(1))$ and, optionally, $s(2) = [i(2),j(2))$ as (end-exclusive) spans. For unary edges such as constituent labels, s(2) is omitted. We take L to be a set of zero or more targets from a task-specific label set L.
- The model is designed to have limited expressive power on its own, as to focus on what information can be extracted from the contextual embeddings. We take a list of contextual vectors $[e_0,e_1,...,e_n]$ and integer spans $s(1) = [i(1),j(1))$ and (optionally) $s(2) = [i(2),j(2))$

Scalar mixing weights to learn a weighted sum of the layers used in different tasks
- To pool across layers, we use the scalar mixing technique introduced by the ELMo model. Following Equation (1) of Peters et al. (2018a), for each task we introduce scalar parameters $γ_τ$ and $a^{(0)}_τ ,a^{(1)}_τ ,...,a^{(L)}_τ$ ,and let: $\mathbf{h}_{i,τ} = γ_τ \Sigma_{l=0}^{L} s^{(l)}_τ \mathbf{h}^{(l)}_i$ where $s_τ = \text{softmax}(a_τ)$ .

Cumulative scoring allows a probing model to combine predictions from multiple layers
- we train a series of classifiers ${P^{(l)}_τ}$ which use scalar mixing (Eq. 1) to attend to layer as well as all previous layers. $P^{(0)}_τ$ corresponds to a non-contextual baseline that uses only a bag of word(piece) embeddings, while P(L) τ = Pτ corresponds to probing all layers of the BERT model. τ These classifiers are cumulative, in the sense that P(+1) has a similar number of parameters but with access to strictly more information than $P^{(l)}_τ$ ,
- can then compute a differential score $∆^{(l)}_τ$ , which measures how much better we do on the probing task if we observe one additional encoder layer : $∆^{(l)}_τ =\text{Score}(P^{(l)}_τ)−\text{Score}(P^{(l−1)}_τ)$

Durrani, N., Sajjad, H., Dalvi, F., & Belinkov, Y. (2020, November). Analyzing Individual Neurons in Pre-trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4865-4880). https://aclanthology.org/2020.emnlp-main.395/

Search
- Algorithm for selecting lambdas (regularization weights) in their Elasticnet implementation
💡
Depends on a function of four parameters:
We then compute score for each lambda set ( $\lambda_1$ , $\lambda_2$ ) as: S( $\lambda_1$ , $\lambda_2$ ) = $\alpha(A_t - A_b) - \beta(A_z - A_l)$
First term
The first term ensures that we select a lambda set where accuracies of top and bottom neurons are further apart
$A_t$ is the accuracy of the classifier retaining top neurons and masking the rest,
$A_b$ is the accuracy retaining bottom neurons,
Second term
the second term ensures that we prefer weights that incur a minimal loss in classifier accuracy due to regularization.
$A_z$ is the accuracy of the classifier trained using all neurons but without regularization, and
$A_l$ is the accuracy with the current lambda set.
Set $\alpha$ and $\beta$ to be 0.5 in our experiments.
- NB: Elasticnet uses L1 and L2 regularization with a ratio of each penalty to determine how strongly to regularize learned weights (e.g., reduce high coefficients and/or force many weights to 0)

Neuron ranking algorithm
- “We use the neuron ranking algorithm as described in Dalvi et al. (2019). Given the trained classifier $\theta \in \R^{D×T}$ , the algorithm extracts a ranking of the D neurons in the model $M$ . For each label $t$ in task $T$ , the weights are sorted by their absolute values in descending order. To select $N$ most salient neurons w.r.t. the task $T$ , an iterative process is carried. … until the set reaches a specified $N$ ”

Minimal neuron selection
1. “Train a classifier to predict the task using all the neurons (call it Oracle),
1. Obtain a neuron ranking based on the ranking algorithm described above,
1. Choose the top N neurons from the ranked list and retrain a classifier using these,
1. Repeat step 3 by increasing the size of $N$ , until the classifier obtains an accuracy close (not less than a specified threshold $\delta$ ) to the Oracle.”

Week 2 readings: Self-attentions

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics. https://aclanthology.org/N19-1357/

“Attentions should correlate with feature importance measures (e.g., gradient-based measures)”

Alternate, parameter shuffling of attentions should mess up the model’s ability to predict things
- However, note, that an alternate weighting of attentions only affects the scalar multiplier to the next layer, the embeddings are still passed upward

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China. Association for Computational Linguistics. https://aclanthology.org/D19-1002/

Time permitting!